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Human breast tumours are diverse in their natural history and in 
their responsiveness to treatments'. Variation in transcriptional 
programs accounts for much of the biological diversity of human 
cells and tumours. In each cell, signal transduction and regulatory 
systems transduce information from the cell's identity to its 
environmental status, thereby controlling the level of expression 
of every gene in the genome. Here we have characterized variation 
in gene expression patterns in a set of 65 surgical specimens of 
human breast tumours from 42 different individuals, using 
complementary DNA microarrays representing 8,102 human 
genes. These patterns provided a distinctive molecular portrait 
of each tumour. Twenty of the tumours were sampled twice, 
before and after a 16- week course of doxorubicin chemotherapy, 
and two tumours were paired with a lymph node metastasis from 
the same patient. Gene expression patterns in two tumour 
samples from the same individual were almost always more 
similar to each other than either was to any other sample. Sets 
of co-expressed genes were identified for which variation in 
messenger RNA levels could be related to specific features of 
physiological variation. The tumours could be classified into 
subtypes distinguished by pervasive differences in their gene 
expression patterns. 

We proposed that the phenotypic diversity of breast tumours 
might be accompanied by a corresponding diversity in gene expres- 
sion patterns that we could capture using cDNA microarrays. 
Systematic investigation of gene expression patterns in human 
breast tumours might then provide the basis for an improved 
molecular taxonomy of breast cancers. We analysed gene expression 
patterns in grossly dissected normal or malignant human breast 
tissues from 42 individuals (36 infiltrating ductal carcinomas, 2 
lobular carcinomas, 1 ductal carcinoma in situ, 1 fibroadenoma and 
3 normal breast samples). Fluorescently labelled (Cy5) cDNA was 
prepared from mRNA from each experimental sample. We prepared 
cDNA, labelled using a second distinguishable fluorescent nucleo- 
tide (Cy3), from a pool of mRNAs isolated from 11 different 



NATURE I VOL 406 1 17 Al 



^ ® 2000 Macmillan Magazines LI 



747 



letters to nature 



cultured cell lines (see Supplementary Information Table 1); this 
common 'reference' sample provided an internal standard against 
which the gene expression of each experimental sample was 
compared^'^. 

Twenty of the forty breast tumours examined were sampled twice, 



as part of a larger study on locally advanced breast cancers (T3/T4 
and/or N2 tumours; see ref. 4). After an open surgical biopsy to 
obtain the 'before' sample, each of these patients was treated with 
doxorubicin for an average of 16 weeks (range 12-23), followed by 
resection of the remaining tumour. In addition, primary tumours 
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Figure 1 Variation in expression of 1,753 genes in 84 experimentai sampies. Data are 
presented in a matrix format: each row represents a singie gene, and each coiumn an 
experimentai sample. In each sample, the ratio of the abundance of transcripts of each 
gene to the median abundance of the gene's transcript among ali the ceii lines (left panel), 
or to its median abundance across all tissue sampies (right panel), is represented by the 
colour of the corresponding cell in the matrix. Green squares, transcript levels below the 
median; black squares, transcript levels equal to the median; red squares, transcript 
levels greater than the median; grey squares, technically inadequate or missing data. 
Colour saturation reflects the magnitude of the ratio relative to the median for each set of 
sampies (see scale, bottom left; and Supplementary Information Fig. 4). a, Dendrogram 
representing similarities In the expression patterns between experimental samples. All 
'before and after' chemotherapy pairs that were clustered on terminal branches are 
highlighted in red; the two primary tumour/lymph node metastasis pairs In light blue; the 
three clustered normal breast samples in light green. Branches representing the four 
breast luminal epithelial ceii lines are shown in dari< blue; breast basal epithelial ceii lines 
in orange, the endothelial ceii lines in darl< yellow, the mesynchemai-iike cell lines in dark 
green, and the lymphocyte-derived ceii lines in brown, b. Scaled-down representation of 
the 1 ,753-gene cluster diagram; coloured bars to the right identify the locations of the 
inserts displayed in c-j. c. Endothelial ceii gene expression cluster; d, stromal/fibroblast 
cluster; e, breast basal epithelial cluster; f, B-ceii cluster; g, adipose-enriched/normal 
breast; h, macrophage; i, T-ceii; j, breast luminal epithelial cell. 



from two patients were also paired with a lymph node metastasis 
from the same patient. To help interpret the variation in expression 
patterns seen in the tumour samples, we also characterized 17 
cultured cell lines (with one cell line cultured under three different 
conditions), which provided models for many of the cell types 
encountered in these tissue samples. In total, we analysed 84 cDNA 
microarray experiments (see Supplementary Information, Table 2; 
the primary data tables can be obtained at http://genome- 
www.stanford.edu/molecularportiaits/). 

A hierarchical clustering method was used to group genes on the 
basis of similarity in the pattern with which their expression varied 
over all samples^. The same clustering method was used to group the 
experimental samples (cell lines and tissues separately) on the basis 
of similarity in their patterns of expression. We focus first on a set of 
1,753 genes (about 22% of the 8,102 genes analysed), whose 
transcripts varied in abundance by at least fourfold from their 
median abundance in this sample set in at least three of the samples 
(Fig. 1; see Supplementary Information Fig. 4 for the complete 
cluster diagram). 

Three striking features of the gene expression patterns of these 
tumours are evident in Fig. 1. First, the tumours show great 
variation in their patterns of gene expression. Second, this variation 



is multidimensional; that is, many different sets of genes show 
mainly independent patterns of variation. Third, these patterns 
have a pervasive order reflecting relationships among the genes, 
relationships among the tumours and connections between specific 
genes and specific tumours. 

The hierarchical clustering algorithm organizes the experimental 
samples only on the basis of overall similarity in their gene 
expression patterns; these relationships are summarized in a den- 
drogram (Fig. la), in which the pattern and length of the branches 
reflects the relatedness of the samples^. Fifteen of the twenty before 
and after doxorubicin pairs (red dendrogram branches), and both 
primary tumour/lymph node metastasis pairs (light blue branches) 
were clustered together on terminal branches in the dendrogram; 
that is, despite an intei-val of 16 weeks, independent surgical 
procedures and cytotoxic chemotherapy, independent samples 
taken from the same tumour were in most cases recognizably 
more similar to each other than either was to any of the other 
samples. In three instances (Norway 47, 61 and 101), the 'after' 
chemotherapy specimens clustered in a branch of the dendrogram 
that also contained the three normal breast samples; we know from 
the clinical data that these tumours were 3 of the 20 tumours that 
were classified as doxorubicin 'responders' (data not shown). An 
analysis of the relationship between gene expression and correla- 
tions with clinical data will be reported elsewhere (TS. et al, 
manuscript in preparation). 

The 'molecular portraits' revealed in the patterns of gene expres- 
sion not only uncovered similarities and differences among the 
tumours, but in many cases pointed to a biological interpretation. 
Variation in growth rate, in the activity of specific signalling path- 
ways, and in the cellular composition of the tumours were all 
reflected in the corresponding variation in the expression of specific 
subsets of genes. The largest distinct cluster of genes within the 
1,753-gene cluster diagram was the 'proltferation cluster' (Supple- 
mentary Information Fig. 5), which is a group of genes whose levels 
of expression correlate with cellular proliferation rates^'^. Expression 
of this cluster of genes varied widely among the tiunour samples, 
and was generally well correlated with the mitotic index. As one 
might expect, this cluster also included the genes encoding two 
widely used immunohistochemical markers of cell proliferation 
(Ki-67 and PCNA). 

Several groups of co-expressed genes provided views of the 
activities of specific signalling and/or regulatory systems. A large 
cluster of genes regulated by the interferon pathway (including 
STATl) showed substantial variation in expression among the 
tumours, as was previously observed in a smaller set of breast 




Figure 2 Breast tissue Immunohistochemistry. a, Normal mammary duct using antibodies Stanford 16 using antibodies against keratins 8/18. d. Tumour New York 3 using 
against the basal keratins 5/6. b, Normal mammary duct using antibodies against the antibodies against keratins 5/6. 
luminal keratins 8/1 8 (adjacent tissues sections were used in a and b ). c. Tumour 
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tumours^. Variation in expression of the oestrogen receptor-a gene 
(ER) correlated well with the direct clinical measurement of the ER 
protein levels in the tumours (Supplementary Information Table 3; 
concordance in 36/38 samples), and paralleled variation in the 
expression of a larger group of genes that included three other 



tianscription factors (GATA-binding protein 3 (refe 7, 8), X-box 
binding protein 1 and hepatocyte nuclear factor 3a). HER2/neu, 
also known as Erb-B2, is overexpressed in 20-30% of all breast 
timiours, usually associated with DNA amplification of the Erb-B2 
locus' '". Notably, most of the other genes contained within the 




Figure 3 Cluster analysis using the 'Intrinsic' gene subset. Two large branches were apparent In the dendrogram, and within these large branches were smaller branches for which 
common biological themes could be Inferred. Branches are coloured accordingly: basal-like, orange; Erb-B2+, pink; normal-breast-IIke, light green; and luminal eplthellal/ER+, dark 
blue, a. Experimental sample associated cluster dendrogram. Small black bars beneath the dendrogram Identify the 1 7 pairs that were matched by this hierarchical clustering; larger 
green bars identify the positions of the three pairs that were not matched by the clustering, b. Scaled-down representation of the Intrinsic cluster diagram (see Supplementary 
information Fig. 6). c, Luminal epIthellal/ER gene cluster, d, Er6-fl2 overexpresslon cluster, e. Basal epithelial cell associated cluster containing keratins 5 and 17. f, A second basal 
epitheilal-ceil-enriched gene cluster. 
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Erb-B2 cluster were located in this same region of chromosome 17, 
and were also amplified on the genomic DNA level (ref. 10; and 
J.R.P., unpublished data). Finally, a cluster of genes that included c- 
Pos and JunB co-varied in expression among the tumour specimens. 
We have found that this subset of genes is characteristically induced 
by prolonged handling of the samples after surgical resection 
(M.v.d.R. and C.M.P., unpublished data). 

Human breast tumours are histologically complex tissues, con- 
taining a variety of cell types in addition to the carcinoma cells". In 
analysing the gene expression patterns in solid human tumours, we 
used two lines of reasoning to infer the lineage of the cells that 
accounted for the apparently cell-type-specific expression of par- 
ticular clustered groups of genes. First, such clusters included genes 
whose expression patterns have been previously characterized and 
that consistently pointed to a specific cell type. Second, these 
inferences were often corroborated by comparable expression of 
the same cluster in one or more of the cultured cell lines. Thus, eight 
independent clusters of genes appeared to reflect variation in 
specific cell types present within the tumours (Fig. Ic-J). 

( 1 ) Endothelial cells: a cluster of genes characteristically expressed 
by endothelial cells, including CD34, CD31 and von Willebrand 
factor were also strongly expressed in the two endothelial cell lines 
HUVEC and HMVEC (Fig. Ic). (2) Stromal cells: a previously 
characterized cluster of genes that included several isoforms of 
collagen showed significant variation in expression among samples 
(Fig. Id)^'*". (3) Adipose-enriched/normal breast cells: a cluster of 
genes including fatty-acid binding protein 4 and PPAR7 may 
represent the presence of adipose cells (Fig. Ig). (4) B lymphocytes: 
variation in expression of a cluster of genes that were highly 
expressed in the multiple myeloma-derived cell line RPMI-8226, 
including many immunoglobulin genes, appears to represent vari- 
able B-cell infiltration (Fig. If). (5) T lymphocytes: a cluster of genes 
including CD38 and two subunits of the T-cell receptor were highly 
expressed in the T-cell leukaemia-derived cell line MOLT-4 and 
probably indicate T-cell infiltrates (Fig. li). (6) Macrophages: a 
cluster of genes that appeared to be markers of macrophage/ 
monocytes included CD68, acid phosphatase 5, chitinase and 
lysozyme (Fig. Ih). 

Two distinct types of epithelial cell are found in the human 
mammary gland: basal (and/or myoepithelial) cells and luminal 
epithelial cells"'^^. These two cell types are conveniently distin- 
guished immunohistochemically; basal epithelial cells can be 
stained with antibodies to keratin 5/6 (Fig. 2a), whereas luminal 
epithelial cells stain with antibodies against keratins 8/18 (Fig. 2b). 
Many genes were expressed by one of these two cell lineages, but not 
by the other (Fig. le and j). The gene expression cluster character- 
istic of basal epithelial cells included keratin 5, keratin 17, integrin- 
(34 and laminin (Fig. le)". The gene expression cluster character- 
istic of the luminal cells was anchored by the previously noted 
cluster of transcription factors that included ER (Fig. IJ). 

One goal of this study was to develop a system for classifying 
tumours on the basis of their gene expression patterns. The subset of 
genes shown in Fig. 1 was not necessarily optimal for this purpose, 
as the choice of genes whose expression levels provided the basis for 
the ordering of the tumour samples determined which phenotypic 
relationships among the tumours were reflected in the clustering 
patterns. We therefore selected an alternative subset of genes to use 
as the basis for a new clustering analysis. 

The rationale behind this alternative gene subset was that specific 
features of a gene expression pattern that are to be used to classify 
tumours should be similar in any sample taken from the same 
tumour, and they should vary among different tumours. The 22 
paired samples provided a unique opportunity for a deliberate and 
systematic search for such genes. From the genes whose expression 
was well measured in the 65 tissue samples, we selected a subset of 
496 genes (termed the 'intrinsic' gene subset) that consisted of genes 
with significantly greater variation in expression between different 
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tumours than between paired samples from the same tumour (see 
Supplementary Information). When variation in expression of this 
set of genes was used to order the tissue samples (Fig. 3; and 
Supplementary Information Fig. 6), 17 of the 20 'before and after' 
doxorubicin pairs were grouped together as were both of the tumour/ 
lymph node metastasis pairs. Quatitatively similar sample clustering 
patterns were obtained when a second gene subset that focused on 
genes expressed by epithelial cell types, and which had only 25% 
overlap with the intrinsic gene subset, was used (data not shown). 

The division of the tissue samples into two subgroups was a 
striking feature of the intrinsic gene subset cluster analysis (Fig. 3 a). 
As a test of the robustness of this division, we applied the 'weighted 
voting' method''. This algorithm recapitulated the sorting of the 
tissue samples between these two subgroups for all but 1 of the 65 
samples (data not shown). It is important to note, however, that there 
is extensive residual variation in expression patterns within each of 
these two broad subgroups. Indeed, many of the finer subdivisions 
probably have important biological properties (see below). 

The two dendrogram branches in Fig. 3 largely separate the 
tumour samples into those that were cUnically described as ER 
positive (blue) and those that were ER negative (other colours). The 
tumours in the ER+ group were characterized by the relatively high 
expression of many genes expressed by breast luminal cells (Fig. 3c) . 
This connection was fiirther corroborated using immunohisto- 
chemical analysis and antibodies against the luminal cell keratins 
8/18 (Fig. 2c). With one exception, none of the tumours in this 
group expressed Erb-B2 at high levels (Fig. 3d). 

Many of the genes characteristic of breast basal epithelial cells 
were also highly expressed in a group of six clustered tumours 
(Fig. 3e). To corroborate the 'basal-like' characteristics of these 
tumours, we carried out immunohistochemistry using antibodies 
against the breast basal cell keratins 5/6 and 17. All six of these 
tumours showed staining for either keratins 5/6 or 17 or both 
(Fig. 2d). Notably, these six tumours also felled to express ER and 
most of the other genes that were usually co-expressed with it 
(Fig. 3c). Breast tumours that stain positive for basal keratins have 
been described"""', and basal keratins may account for 3-15% of all 
breast tumours'"'''"''; in this study, the incidence was 15% (6/40). 

As mentioned above, overexpression of the Erb-B2 oncogene was 
associated with the high expression of a specific subset of genes. We 
identified a cluster of tumours that was partially characterized by 
the high level of expression of this subset of genes (Fig. 3d). These 
tumours also showed low levels of expression of ER^"'^' and of 
almost all of the other genes associated with ER expression — a trait 
they share with the basal-like tumours. 

Several tumour samples and the single fibroadenoma tested 
(Fig. 3, light green), were clustered with a group of samples that 
also contained the three normal breast specimens (Fig. 3a). The 
'normal breast' gene expression pattern is typified by the high 
expression of genes characteristic of basal epithelial cells and 
adipose cells, and the low expression of genes characteristic of 
luminal epithelial cells. 

The number of clearly different molecular phenotypes observed 
among the breast tumours suggests that we are far from having a 
complete picture of the diversity of breast tumours. When hundreds 
(instead of tens) of breast tumours have been characterized, a more 
defined tumour classification is likely, and statistically significant 
relationships with clinical parameters should be uncovered. We 
were, however, able to identify four groups of samples that might 
be related to different molecular features of mammary epithelial 
biology (that is, ER-h/luminal-like, basal-like, Erb-B2+ and normal 
breast). An important implication of this study is that the clinical 
designation of 'oestrogen receptor negative' breast carcinoma 
encompasses at least two biologically distinct subtypes of tumours 
(basal-like and ErB-B2 positive), which may need to be treated as 
distinct diseases. 

A striking conclusion from these data concerns the stability, 
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homogeneity and uniqueness of the 'molecular portraits' provided 
by the quantitative analysis of gene expression patterns. We infer 
that these portraits faithfully represent the 'tumour' itself, and not 
merely the particular tumour 'sample', because we could recognize 
the distinctive expression pattern of a tumour in independent 
samples. The finding that a metastasis and primary tumour were 
as similar in their overall pattern of gene expression as were repeated 
samplings of the same primary tumour, suggests that the molecular 
program of a primary tumour may generally be retained in its 
metastases. Finally, we have explicitly discussed only a tiny fraction 
of the genes whose expression patterns varied among these 
tumours. Attention to the thousands of individual genes that 
define the molecular portraits of each tumour, and learning to 
mterpret their patterns of variation, will undoubtedly lead to a 
deeper and more complete understanding of breast cancers. H 
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Sample damage by X-rays and other radiation limits the resolu- 
tion of structural studies on non-repetitive and non-reproducible 
structures .such as individual biomolecules or cells'. Cooling can 
slow sample deterioration, but cannot eliminate damage-induced 
sample movement during the time needed for conventional 
measurements''^. Analyses of the dynamics of damage forma- 
tion'"^ suggest that the conventional damage barrier (about 200 
X-ray photons per with X-rays of 12 keV energy or 1 A 
wavelength^) may be extended at very high dose rates and very 
short exposure times. Here we have used computer simulations to 
investigate the structural information that can be recovered from 
the scattering of intense femtosecond X-ray pulses by single 
protein molecules and small assemblies. Estimations of radiation 
damage as a function of photon energy, pulse length, integrated 
pulse intensity and sample size show that experiments using very 
high X-ray dose rates and ultrashort exposures may provide useful 
structural information before radiation damage destroys the 
sample. We predict that such ultrashort, high-intensity X-ray 
pulses from free-electron lasers^ ' that are currently under devel- 
opment, in combination with container-free sample handling 
methods based on spraying techniques, will provide a new 
approach to structural determinations with X-rays. 

Radiation damage is caused by X-ray photons depositing energy 
directly into the sample. At 1 A wavelength, the photoelectric cross- 
section of carbon is about 10 times higher than its elastic-scattering 
cross-section, making the photoelectric effect the primary source of 
damage. The photoelectric effect is a resonance phenomenon in 
which a photon is absorbed and an electron ejected*, usually from a 
low-lying orbital of the atom (about 95% of the photoelectric events 
remove K-shell electrons from carbon, nitrogen, oxygen and sul- 
phur), producing a hollow ion with an unstable electronic config- 
uration. Relaxation is achieved through an electron from a higher 
shell falling into the vacant orbital. In heavy elements this usually 
gives rise to X-ray fluorescence, whereas in light elements the falling 
electron is more likely to give up its energy to another electron, 
which is then ejected in the Auger effect. Auger emission is pre- 
dominant in carbon, nitrogen, oxygen and sulphur (> 95%)''; thus, 
most photoelectric events ultimately remove two electrons from 
these elements. These two electrons have different energies 
(~12keV for photoelectrons and —0.25 keV for Auger electrons), 
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